Data Sources and Overview

The source of the dataset comes from R for Data Science’s Tidy Tuesday Project, which provides weekly releases of raw datasets for users to wrangle and analyze. The data itself originates from FiveThirtyEight, containing movies ranging from 1970 to 2013, and merges data from several sources:

  • IMDB: Provides information about each film, such as year, runtime, genre, language, rating.
  • BechdelTest.com: Provides year, title, and a Bechdel Test score ranging from 0-3.
  • The-Numbers.com: Provides information about budget and revenue of each movie.

Data Import and Cleaning

raw_bechdel = read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-09/raw_bechdel.csv')

movies = read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-09/movies.csv')

The dataset was distributed in two files, movies.csv, raw_bechdel.csv.

  • movies.csv: A data file of 1794 observations and 34 variables, which includes movies from 1970 to 2013, and contains variables for title, IMDB ID, year, Bechdel test (uncleaned, cleaned, binary pass/fail), genres, ratings, budget, domestic and international gross revenue, and 2013-adjusted budget and revenues.
  • raw_bechdel.csv: A data file of 8839 observations and 5 variables, containing movies from 1888 to 2021, and contains variables for title, movie ID, IMDB ID, and raw Bechdel test score (0 to 3).

Data cleaning

The data cleaning steps involve the following: - Converting character to numeric variables for domgross, intgross, domgross_2013, intgross_2013

  • Converting the binary character variable to a logical variable that indicates whether the movie passes or fails the Bechdel Test

  • Recoding and relevelling the numeric decade_code variable into a decade factor variable according to the year of release

  • Recoding and relevelling the clean_test variable into 4 levels corresponding to the Bechdel Test criteria:

    • 0 = Less than 2 women
    • 1 = Women don’t talk to each other
    • 2 = Women only talk about men
    • 3 = Dubious (uncertain whether it passes Bechdel Test)
    • 4 = Passes Bechdel Test
  • Creating variables for profit and ROI for each movie. For this, we will use worldwide box office revenue instead of domestic revenue.

    • profit = intgross_2013 – budget_2013
    • ROI = profit / budget_2013
  • Renaming key variables to relevant names

movies = movies %>% 
  mutate(test = as.factor(test), 
         clean_test = fct_recode(clean_test, 
                                 "Less than 2 women" = "nowomen", 
                                 "Don't talk to each other" = "notalk",
                                 "Only talk about men" = "men", 
                                 "Dubious" = "dubious",
                                 "Passes Bechdel" = "ok"),
         clean_test = fct_relevel(clean_test, c("Less than 2 women", "Don't talk to each other", 
                                                "Only talk about men", "Dubious", "Passes Bechdel")),
         binary = ifelse(binary == "PASS", TRUE, FALSE), 
         domgross = as.numeric(domgross),
         intgross = as.numeric(intgross),
         domgross_2013 = as.numeric(domgross_2013),
         intgross_2013 = as.numeric(intgross_2013),
         decade_code = case_when(year >= 1970 & year < 1980 ~ "1970-1979",
                            year >= 1980 & year < 1990 ~ "1980-1989",
                            year >= 1990 & year < 2000 ~ "1990-1999",
                            year >= 2000 & year < 2010 ~ "2000-2009",
                            year >= 2010 & year < 2020 ~ "2010 - present"),
         decade_code = as.factor(decade_code),
         title = str_replace(title, "&#39;", "'"),
         title = str_replace(title, "&amp;", "&"),
         title = str_replace(title, "&agrave;", "à"),
         title = str_replace(title, "&aring;", "å"),
         title = str_replace(title, "&auml;", "ä"), 
         profit = intgross_2013 - budget_2013,
         ROI = profit/budget_2013) %>% 
  rename("pass_bechdel" = binary,
         "bechdel_score" = clean_test,
         "decade" = decade_code) %>% 
  select(year, title, bechdel_score, pass_bechdel, budget_2013:intgross_2013, decade, imdb_id, language, metascore, imdb_rating, genre:runtime, profit, ROI)

head(movies) %>% 
  kableExtra::kbl() %>% 
  kableExtra::kable_paper("striped", "hover", full_width = F) %>% 
  kableExtra::scroll_box(width = "100%", height = "300px")
year title bechdel_score pass_bechdel budget_2013 domgross_2013 intgross_2013 decade imdb_id language metascore imdb_rating genre awards runtime profit ROI
2013 21 & Over Don’t talk to each other FALSE 13000000 25682380 42195766 2010 - present 1711425 NA NA NA NA NA NA 29195766 2.2458282
2012 Dredd 3D Passes Bechdel TRUE 45658735 13611086 41467257 2010 - present 1343727 NA NA NA NA NA NA -4191478 -0.0918001
2013 12 Years a Slave Don’t talk to each other FALSE 20000000 53107035 158607035 2010 - present 2024544 English 97 8.3 Biography, Drama, History Won 3 Oscars. Another 131 wins & 137 nominations. 134 min 138607035 6.9303517
2013 2 Guns Don’t talk to each other FALSE 61000000 75612460 132493015 2010 - present 1272878 English, Spanish 55 6.8 Action, Comedy, Crime 1 win. 109 min 71493015 1.1720166
2013 42 Only talk about men FALSE 40000000 95020213 95020213 2010 - present 0453562 English 62 7.6 Biography, Drama, Sport 3 wins & 13 nominations. 128 min 55020213 1.3755053
2013 47 Ronin Only talk about men FALSE 225000000 38362475 145803842 2010 - present 1335975 English, Japanese 29 6.6 Action, Adventure, Fantasy 1 nomination. 118 min -79196158 -0.3519829

Our resulting dataset contains 1794 observations and 17 variables, indicating information about each movie’s Bechdel Test score, budget, revenue, genre, and ratings.

Exploratory Data Analysis

Distribution of Bechdel Test Scores

movies %>% 
  group_by(bechdel_score) %>% 
  summarise(N = n()) %>% 
  mutate(Proportion = N/sum(N)) %>% 
  rename("Bechdel Test Criterion" = bechdel_score) %>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Bechdel Test Criterion N Proportion
Less than 2 women 141 0.079
Don’t talk to each other 514 0.287
Only talk about men 194 0.108
Dubious 142 0.079
Passes Bechdel 803 0.448

Bechdel Test Scores by decade

less2_df = movies %>% 
  filter(bechdel_score == "Less than 2 women") %>% 
  group_by(decade, bechdel_score) %>% 
  summarise(n = n()) 

notalk_df = movies %>% 
  filter(bechdel_score == "Don't talk to each other") %>% 
  group_by(decade, bechdel_score) %>% 
  summarise(n = n()) 

talkmen_df = movies %>% 
  filter(bechdel_score == "Only talk about men") %>% 
  group_by(decade, bechdel_score) %>% 
  summarise(n = n()) 

dubious_df = movies %>% 
  filter(bechdel_score == "Dubious") %>% 
  group_by(decade, bechdel_score) %>% 
  summarise(n = n()) 

pass_df = movies %>% 
  filter(bechdel_score == "Passes Bechdel") %>% 
  group_by(decade, bechdel_score) %>% 
  summarise(n = n()) 

table = bind_cols(less2_df, notalk_df, talkmen_df, dubious_df, pass_df)

table %>% 
  plot_ly(x = ~decade...1, y = ~n...3, type = "bar", name = "Less than 2 women",
          marker = list(color = "darkred")) %>% 
  add_trace(y = ~n...6, name = "Don't talk to each other", marker = list(color = "red")) %>% 
  add_trace(y = ~n...9, name = "Only talk about men", marker = list(color = "darkorange")) %>% 
  add_trace(y = ~n...12, name = "Dubious", marker = list(color = "yellow")) %>% 
  add_trace(y = ~n...15, name = "Passes Bechdel", marker = list(color = "lightgreen")) %>% 
  layout(barmode = "stack",
         xaxis = list(title = "Decade"),
         yaxis = list(title = "Count"))

Bechdel Test Scores by genre

Bechdel Test by IMDB Rating

movies %>% 
  plot_ly(y = ~imdb_rating, x = ~bechdel_score, type = "scatter", 
          mode = "markers", marker = list(color = ~imdb_rating)) %>% 
  layout(yaxis = list(title = list(text = "IMDB Rating", standoff = 5), tickfont = list(size = 10), gridcolor = "white"),
         xaxis = list(title = "Bechdel Criterion"), tickfont = list(size = 10), gridcolor = "gray")

Distribution of budgets according to Bechdel scores

Next, we want to explore how movie budgets may differ according to the primacy of women’s roles in movies. Grouping by Bechdel score, we can compute the median budget, adjusted to 2013 inflation.

movies %>% 
  group_by(bechdel_score) %>% 
  summarise(median_budget = median(budget_2013)) %>% 
  plot_ly(x = ~median_budget, y = ~bechdel_score, type = "bar", color = ~bechdel_score, colors = "YlGn") %>% 
  layout(yaxis = list(title = "Bechdel Criterion", standoff = 10),
         xaxis = list(title = "Median movie budget ($)"), 
         legend = list(reverse = TRUE))
movies %>% 
  plot_ly(x = ~bechdel_score, y = ~budget_2013, type = "box", text = ~title) %>% 
  layout(yaxis = list(title = "Movie budget ($)", standoff = 10),
         xaxis = list(title = "Bechdel Criterion" ), 
         legend = list(reverse = TRUE))

We can visualize the median budgets with a bar chart to see that movies featuring two women who don’t talk to each other appear to have much larger budgets than the rest. Movies that pass the Bechdel test also appear to have slightly smaller budgets than movies that don’t pass.

Next steps…

  • F-test (ANOVA) of differences in budget
budget_dist_p = movies %>% ggplot(aes(x = budget_2013)) + geom_histogram(alpha = 0.8, color = "white") + 
  labs(
    x = "Count",
    y = "Budget ($, 2013-adjusted)",
    title = "Distribution of movie budgets")

ggplotly(budget_dist_p)

It appears that budget is heavily right-skewed. For this, we will need to run a Kruskal-Wallace test in place of an F-test. Below are the results:

kruskal.test(budget_2013 ~ bechdel_score, data = movies) %>% 
  broom::tidy() %>% 
  rename("Test statistic" = statistic,
         "p-value" = p.value, 
         "Parameter (df)" = parameter,
         "Method" = method) %>% 
  kableExtra::kbl() %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
  kableExtra::kable_styling(font_size = 12)
Test statistic p-value Parameter (df) Method
57.04459 0 4 Kruskal-Wallis rank sum test